Low Latency RNN Inference with Cellular Batching

نویسندگان

  • Pin Gao
  • Lingfan Yu
  • Yongwei Wu
  • Jinyang Li
چکیده

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tuning Paxos for High-Throughput with Batching and Pipelining

Paxos is probably the most known state machine replication protocol. Two optimizations that can greatly improve its performance are batching and pipelining. Their effectiveness depends significantly on the system properties, mainly network latency and bandwidth, but also on the CPU speed and properties of the application. This makes it hard to know when and how to use each optimization to achie...

متن کامل

Low Latency Anonymity with Mix Rings

We introduce mix rings, a novel peer-to-peer mixnet architecture for anonymity that yields low-latency networking compared to existing mixnet architectures. A mix ring is a cycle of continuous-time mixes that uses carefully coordinated cover traffic and a simple fan-out mechanism to protect the initiator from timing analysis attacks. Key features of the mix ring architecture include decoupling ...

متن کامل

Neural Speed Reading via Skim-RNN

Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be ea...

متن کامل

Low Power and Low Latency Phase-‎Frequency Detector in Quantum-Dot ‎Cellular Automata Nanotechnology

   Nowadays, one of the most important blocks in telecommunication circuits is the frequency synthesizer and the frequency multipliers. Phase-frequency detectors are the inseparable parts of these circuits. In this paper, it has been attempted to design two new structures for phase-frequency detectors in QCA nanotechnology. The proposed structures have the capability of detecting the phase ...

متن کامل

Building a Transparent Batching Layer for Storm

 Building a Transparent Batching Layer for Storm Matthias J. Sax, Malu Castellanos HP Laboratories HPL-2013-69 streaming data, distributed streaming system, batching, performance, optimization Storm is a distributed intra-node-parallel stream processing system built for very low latency processing. One major drawback of Storm is its relatively low throughput. In order to increase Storm's throu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018